Container Monitoring: Beyond cAdvisor

cAdvisor was the tool that democratised container monitoring. And remains relevant — kubelet includes it internally. But in 2024, observing containers well requires more layers: cluster state metrics, eBPF for deep visibility, APM for application context. This article maps what to combine and how.

The Modern Minimum Stack

For serious Kubernetes in 2024:

kubelet / cAdvisor: CPU, memory, network, disk metrics per container.
kube-state-metrics: state of Deployments, Pods, ReplicaSets, HPA.
node-exporter: node metrics.
Prometheus scrapes everything, aggregates.
Grafana visualises.

This covers 80% of what you need to monitor. And is a solid OSS stack.

What’s Missing Without eBPF

cAdvisor gives “surface” metrics:

CPU usage %.
Memory RSS.
Network bytes.
Disk I/O.

But not:

Syscall latency: is the pod stuck on I/O?
Network latencies between specific pods.
CPU profile: which functions consume.
Function-level detail: hotpaths.

For this, eBPF is the modern tool.

eBPF: The Changing Layer

Modern eBPF tools:

Pixie

Pixie (CNCF sandbox, originally New Relic):

Auto-instrumentation HTTP/gRPC/DNS without sidecar or code changes.
Live flame graphs.
Automatic service map.
PxL-language queries.

One per-node eBPF agent + web UI. Developer-friendly.

Grafana Beyla

Beyla:

Auto-instrumentation for Go, Java, Node apps.
Generates OpenTelemetry traces without code modification.
Grafana stack integration.

Simpler than Pixie, focused on traces/metrics.

Parca

Parca:

Continuous profiling of the whole cluster.
eBPF flame graphs.
Grafana-integrable.

Specific for CPU profiling.

Inspektor Gadget

Kinvolk/Microsoft tool for eBPF debugging:

kubectl trace equivalents.
Per-pod network snapshots.
On-demand profiling.

APM: The Application Layer

eBPF gives infra visibility; APM gives application visibility:

OpenTelemetry: open standard, increasingly adopted.
Jaeger / Tempo: trace backends.
Datadog / New Relic / Dynatrace: complete commercial.
Grafana Tempo: tempo.

With OTel SDK, your app instruments:

Request spans.
Business metrics.
Correlated logs.

Beyla can auto-generate some, but for business metrics, you need SDK.

Combining Without Saturating

Common error: all tools = massive overhead. Typical sweet spot:

cAdvisor + kube-state-metrics + node-exporter: light base.
eBPF (Pixie or Beyla): add when needing deep visibility.
APM with OTel: for critical apps, not all.
Commercial APM: only with clear use case vs OSS.

Each layer should add distinct value. Duplicating is waste.

Essential Per-Container Metrics

Always monitored:

CPU throttling: is the pod rate-limited?
Memory working set: real use, not RSS.
OOM kills: key counter.
Network errors: TX/RX drops.
Disk pressure: fullness + I/O saturation.
Restart count: flapping = problem.

For K8s additionally:

Pod phase: Pending, Running, Failed.
Readiness probe failures.
HPA desired vs current.
PVC usage.

Alerts Worth Having

Few but effective:

Pod restart > N in Y minutes: flapping.
Sustained CPU throttling > 50%: insufficient resource.
OOM kills: always investigate.
Memory > 90% limit sustained: leak or sizing.
Node not ready > X minutes: incident.
HPA at max replicas for > Y min: capacity issue.

Fewer useful alerts > many ignored alerts.

Dashboards: What to Show

Typical levels:

Cluster overview: total resources, nodes, pods.
Namespace: per team/application.
Workload: specific deploy, pods, containers.
Pod detail: drill-down for troubleshoot.

Official Kubernetes Grafana dashboards (IDs 315, 6417, etc) are good starting points.

Logging Integration

Metrics without logs is half the picture. Typical stack:

Fluent Bit or Loki-native shippers for log collection.
Loki for storage + Grafana for visualization.
Trace correlation via trace IDs.

When investigating an incident, you need metrics + logs + traces on the same timeline.

Security Observability

Complementary:

Falco: eBPF runtime security.
Tracee (Aqua): similar, eBPF-based.
Kubernetes API audit logs.

Not standard “monitoring” but part of the complete picture.

Tools No Longer Recommended

Telegraf: valid but Prometheus ecosystem is default now.
Standalone InfluxDB: Loki + Prometheus cover.
Legacy Stackdriver: GCP-only, lock-in.
ELK for metrics: Elastic better for logs alone.

A Practical Example

Typical 50-node, 500-pod cluster:

Prometheus federation: ~2000 targets, 5M series.
Retention: 30 days hot + object storage.
Grafana with 10-15 curated dashboards.
20-30 useful alerts.
Beyla in some namespaces for traces.
Loki for logs.

OSS stack, ~5% overhead of total cluster in CPU/RAM.

Conclusion

Monitoring containers well in 2024 requires more than cAdvisor. The OSS base (Prometheus + kube-state-metrics + node-exporter + Grafana) is solid and sufficient for most. eBPF (Pixie, Beyla, Parca) adds deep visibility when needed. APM with OpenTelemetry complements with application vision. The trap is over-engineering: more tools = more maintenance and noise. Start with solid base, add layers when use case justifies, and maintain alert discipline — fewer well-thought ones win.